Bootstrapping structured page segmentation
نویسندگان
چکیده
In this paper, we present an approach to the bootstrapping learning of a page segmentation model. The idea evolves from attempts to segment dictionaries that often have a consistent page structure, and is extended to the segmentation of more general structured documents. In cases of highly regular structure, the layout can be learned from examples of only a few pages. The system is first trained using a small number of samples, and a larger test set is processed based on the training result. After making corrections to a selected subset of the test set, these corrected samples are combined with the original training samples to generate bootstrap samples. The newly created samples are used to retrain the system again to refine the learned features and resegment the test samples. This procedure is applied iteratively until the learned parameters are stable. Using this approach, we do not need to provide a large group of training set initially, and by bootstrapping, the results can be refined step by step. We have applied this segmentation to many structured documents such as dictionaries, phone books, spoken language transcripts, and obtained satisfying segmentation performance.
منابع مشابه
Persian Printed Document Analysis and Page Segmentation
This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...
متن کاملTitle of dissertation : ADAPTIVE ANALYSIS AND PROCESSING OF STRUCTURED MULTILINGUAL
Title of dissertation: ADAPTIVE ANALYSIS AND PROCESSING OF STRUCTURED MULTILINGUAL DOCUMENTS Huanfeng Ma, Doctor of Philosophy, 2006 Dissertation directed by: Professor Rama Chellappa Dr. David S. Doermann Electrical and Computer Engineering Department Digital document processing is becoming popular for applications to office and library automation, bank and postal services, publishing houses a...
متن کاملPoorly Structured Handwritten Documents Segmentation using Continuous Probabilistic Feature Grammars
This work deals with poorly structured handwritten documents segmentation such as pages of handwritten notes produced with pen-based interfaces. We propose to use a formalism, based on Probabilistic Feature Grammars, that exhibit some interesting features. It allows handling ambiguities and to taking into account contextual information such as spatial relations between objects in the page.
متن کاملChinese word segmentation model using bootstrapping
We participate in the CIPS-SIGHAN2010 bake-off task of Chinese word segmentation. Unlike the previous bakeoff series, the purpose of the bakeoff 2010 is to test the crossdomain performance of Chinese segmentation model. This paper summarizes our approach and our bakeoff results. We mainly propose to use χ statistics to increase the OOV recall and use bootstrapping strategy to increase the overa...
متن کاملReverse Engineering Method of Web Application to UML Presentation Model Using Vision Based Segmentation Method
In recent years, many web applications are available to use. Most of these applications are poorly modeled or not modeled at all. One of the main modeling techniques is presentation modeling in which the layout of the page is shown. In this paper we present a new reverse engineering method, which takes a web page as input and returns a UML presentation model that represents the page. We applied...
متن کامل